K Means Clustering Project

Usually when dealing with an unsupervised learning problem, its difficult to get a good measure of how well the model performed. For this project, we will use data from the UCI archive based off of red and white wines (this is a very commonly used data set in ML).

We will then add a label to the a combined data set, we'll bring this label back later to see how well we can cluster the wine into groups.

Get the Data

Download the two data csv files from the UCI repository (or just use the downloaded csv files).

Use read.csv to open both data sets and set them as df1 and df2. Pay attention to what the separator (sep) is.

In [10]:

Now add a label column to both df1 and df2 indicating a label 'red' or 'white'.

In [11]:

Check the head of df1 and df2.

In [12]:
Out[12]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
17.40.701.90.07611340.99783.510.569.45red
27.80.8802.60.09825670.99683.20.689.85red
37.80.760.042.30.09215540.9973.260.659.85red
411.20.280.561.90.07517600.9983.160.589.86red
57.40.701.90.07611340.99783.510.569.45red
67.40.6601.80.07513400.99783.510.569.45red
In [13]:
Out[13]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
170.270.3620.70.045451701.00130.458.86white
26.30.30.341.60.049141320.9943.30.499.56white
38.10.280.46.90.0530970.99513.260.4410.16white
47.20.230.328.50.058471860.99563.190.49.96white
57.20.230.328.50.058471860.99563.190.49.96white
68.10.280.46.90.0530970.99513.260.4410.16white

Combine df1 and df2 into a single data frame called wine.

In [14]:
In [15]:
str(wine)
'data.frame':	6497 obs. of  13 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
 $ label               : chr  "red" "red" "red" "red" ...

EDA

Let's explore the data a bit and practice our ggplot2 skills!

Create a Histogram of residual sugar from the wine data. Color by red and white wines.

In [16]:
In [37]:

Create a Histogram of citric.acid from the wine data. Color by red and white wines.

In [39]:

Create a Histogram of alcohol from the wine data. Color by red and white wines.

In [40]:

Create a scatterplot of residual.sugar versus citric.acid, color by red and white wine.

In [49]:

Create a scatterplot of volatile.acidity versus residual.sugar, color by red and white wine.

In [52]:

Feel free to explore the data as you see fit, we'll go ahead and move on!

Grab the wine data without the label and call it clus.data

In [65]:

Check the head of clus.data

In [63]:
Out[63]:
fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcoholqualitylabel
17.40.701.90.07611340.99783.510.569.45red
27.80.8802.60.09825670.99683.20.689.85red
37.80.760.042.30.09215540.9973.260.659.85red
411.20.280.561.90.07517600.9983.160.589.86red
57.40.701.90.07611340.99783.510.569.45red
67.40.6601.80.07513400.99783.510.569.45red

Building the Clusters

Call the kmeans function on clus.data and assign the results to wine.cluster.

In [74]:

Print out the wine.cluster Cluster Means and explore the information.

In [76]:
  fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1      7.619044        0.4079451   0.2911080       3.082690 0.0656846
2      6.904698        0.2871364   0.3398094       7.259286 0.0486092
  free.sulfur.dioxide total.sulfur.dioxide   density       pH sulphates
1            18.43735             63.54832 0.9945680 3.255147 0.5718655
2            39.82503            155.90101 0.9947956 3.190308 0.5000354
   alcohol  quality
1 10.79529 5.809204
2 10.25832 5.825436

Evaluating the Clusters

You usually won't have the luxury of labeled data with KMeans, but let's go ahead and see how we did!

Use the table() function to compare your cluster results to the real results. Which is easier to correctly group, red or white wines?

In [85]:
Out[85]:
       
           1    2
  red   1515   84
  white 1310 3588

Great Job!